DEPARTMENT OF STATISTICS 
University of Wisconsin 
1300 University Ave. 
Madison, WI 53706 



TECHNICAL REPORT NO. 1173 
March 22, 2013 

Statistical Model Building, Machine Learning, and the Ah-Ha 

Moment 

Grace WahbJ] 

Department of Statistics, Department of Computer Sciences 
and Department of Biostatistics and Medical Informatics 
University of Wisconsin, Madison 



Research supported in part by NIH Grant EY09946 and NSF Grant DMS-0906818. Prepared for the 
volume "Past, Present and Future of Statistics, in Celebration of the 50th Anniversary of the Committee of 
Presidents of Statistical Societies (COPSS) 



Preamble 



As a recipient of the Elizabeth Scott Award, I received the following invitation from 
David Scott and COPSS: 

"The Committee of Presidents of Statistical Societies (COPSS) will celebrate its 50th 
Anniversary in 2013. As part of its celebration, COPSS intends to publish a book with 
contributions from the past recipients of its four awards, namely the Fisher Lecture Award, 
the President's Award, the Elizabeth Scott Award, and the FN David Award. The theme of 
the book is Past, Present and Future of Statistical Science. As a past winner of one of the 
COPSS awards, we would like to invite you to contribute to this book... I will be working 
with you personally on your contribution ... Specifically, we are seeking contributions along 
(one of) the following lines: 

1. Statistical Research: Your reflection on the past, present or future of a statistical 
research area of your choice. You are free to decide what you would like to focus on, 
i.e., past, present or future, or possibly all three. 

2. Statistical Education: Your reflection/view on statistical education in some areas, 
e.g., BIG DATA, interdisciplinary research, graduate and undergraduate curriculum, 
promoting statistical education and research in developing countries, or statistical ed- 
ucational outreach. 

3. Statistical Career: Your reflection on your own career, lessons and experience you have 
learned, and advice you would like to provide to young statisticians if sought. 

4. A blend of the above three topics. " 
(signed by David Scott representing COPSS). 

I have chosen to focus on Item 3, reflecting on some particular "fun" pieces of my own 
statistical career, sprinkled with advice, and augmented with a few assorted remarks. Fol- 
lowing is my contribution. 



1 



Chapter 1 



Statistical Model Building, Machine 
Learning, and the Ah-Ha Moment 

[March 22, 2013 Chapter by Grace Wahba for the Committee of Presidents of Statistical 
Societies (COPSS) 50th anniversary volume] 

Highly selected "Ah-Ha" moments from the beginning to the present of my research 
career are recalled - these are moments when the main idea just popped up instantaneously, 
sparking sequences of future research activity- almost all of these moments crucially involved 
discussions/interactions with others. Along with a description of these moments we give 
unsought advice to young statisticians. We conclude with remarks on issues relating to 
statistical model building/machine learning in the context of human subjects data. 

1.1 Introduction- Manny Parzen and RKHS 

Many of the "Ah-Ha" moments below involve Reproducing Kernel Hilbert Spaces (RKHS) 
so we begin there. My introduction to RKHS came while attending a class given by Manny 
Parzen on the lawn in front of the old Sequoia Hall at Stanford around 1963. See [21] ■ 

For many years RKHS [TJ [33] were a little niche corner of research which suddenly became 
popular when their relation to Support Vector Machines (SVMs) became clear- more on that 
later. To understand most of the Ah-Ha moments it may help to know a few facts about 
RKHS which we now give. 

An RKHS is a Hilbert space % where all of the evaluation functionals are bounded linear 
functionals. What this means is the following: Let the domain of H be T, and the inner 
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product <•, •>. Then, for each t G T there exists an element, call it K t in H, with the 
property f(t) =< f, K t > for all / in %. K t is known as the representer of evaluation 
at t. Let K(s,t) =< K s ,K t >; this is clearly a positive definite function on T ®T. By 
the Moore- Aronszajn theorem, every RKHS is associated with a unique positive definite 
function, as we have just seen. Conversely, given a positive definite function, there exists 
a unique RKHS (which can be constructed from linear combinations of the K t ,t G T and 
their limits). Given K(s,t) we denote the associated RKHS as %k- Observe that nothing 
has been assumed concerning the domain T. A second role of positive definite functions 
is as the covariance of a zero mean Gaussian stochastic process on T. In a third role role 
that we will come across later - let Oj,i = 1, 2, . . . , n be a set of n abstract objects. An 
n x n positive definite matrix can be used to assign pairwise squared Euclidean distances 
dij between O; and Oj by d^ = K(i,i) + K(j,j) — 2(K(i,j). In Sections ll.l.ltfl~l~9l we 
go through some Ah-Ha moments involving RKHS, positive definite functions and pairwise 
distances/dissimilarities. Section [TT21 discusses sparse models and the LASSO. Section [T731 
has some remarks involving complex interacting attributes, the "Nature-Nurture" debate, 
Personalized Medicine, Human subjects privacy and scientific literacy, and we end with 
conclusions in Section 11.41 

I end this section by noting that Manny Parzen was my thesis advisor, and Ingram Olkin 
was on my committee. My main advice to young statisticians is: Choose your advisor and 
committee carefully, and be as lucky as I was. 

1.1.1 George Kimeldorf and the Representer Theorem 

Back around 1970 George Kimeldorf and I both got to spend a lot of time at the Math 
Research Center at the University of Wisconsin-Madison (the one that later got blown up 
as part of the anti- Vietnam- war movement). At that time it was a hothouse of spline work, 
headed by Iso Schoenberg, Carl deBoor, Larry Schumaker and others, and we thought that 
smoothing splines would be of interest to statisticians. The smoothing spline of order m 
was the solution to: find / in the space of functions with square integral mth derivative to 
minimize 



Professor Schoenberg many years ago had characterized the solution to this problem as a 




(1.1) 
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piecewise polynomial of degree 2m — 1 satisfying some boundary and continuity conditions. 

Our Ah-Ha moment came when we observed that the space of functions with square 
integrable mth derivative on [0, 1] was an RKHS with seminorm \\Pf\\ defined by ||-P/|| 2 = 
J^(f^ m \t)) 2 dt and with an associated K(s,t) that we could figure out. (A seminorm is 
exactly like a norm except that it has a non-trivial null space, here the null space of this 
seminorm is the span of the polynomials of degree m — 1 or less.) Then by replacing f(t) by 
< Kf, f > it was not hard to show by a very simple geometric argument that the minimizer of 
(11. ip was in the span of the K t , t = t%, . . . ,t n and a basis for the null space of the seminorm. 
But furthermore, the very same geometrical argument could be used to solve the more general 
problem: find / G Hk, an RKHS, to minimize 

n 

J2C(y t ,LJ) + \\\Pf\\ 2 K (1.2) 

i=l 

where C(yi,Lif) is convex in Lif, with Lj a bounded linear functional in H/c and ||f/||ff 
a seminorm in %k- A bounded linear functional is a linear functional with a representer in 
Hk, that is, there exists rji G Wr such that Lif =< rji, f > for all / G Hr- The minimizer 
of (II. 2p is in the span of the representers rji and a basis for the null space of the seminorm. 
That is known as the representer theorem, which turned out to be a key to fitting (mostly 
continuous) functions in an infinite dimensional space, given a finite number of pieces of 
information. There were two things I remember about our excitement over the result: One 
of us, I'm pretty sure it was George, thought the result was too trivial and not worthwhile 
to submit, but submit it we did and it was accepted [12] without a single complaint, within 
three weeks. I have never since then had another paper accepted by a refereed journal within 
three weeks and without a single complaint. Advice: If you think it is worthwhile, submit 
it. 

1.1.2 Svante Wold and Leaving-Out-One 

Following Kimeldorf and Wahba, it was clear that for practical use, a method was needed 
to choose the smoothing or tuning parameter A in (II. ip . The natural goal was to minimize 
the mean square error over the function /, for which its values at the data points would 
be the proxy. In 1974 Svante Wold visited Madison, and we got to mulling over how to 
choose A. It so happened that Mervyn Stone gave a colloquium talk in Madison, and Svante 
and I were sitting next to each other as Mervyn described using leaving-out-one to decide 
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on the degree of a polynomial to be used in least squares regression. We looked at each 
other at that very minute and simultaneously said something, I think it was "Ah-Ha", but 
possibly "Eureka" . In those days computer time was $600/hour and Svante wrote a computer 
program to demonstrate that leaving-out-one did a good job. It took the entire Statistics 
department's computer money for an entire month to get the results in [35]. Advice: Go to 
the colloquia, sit next to your research pals. 

1.1.3 Peter Craven, Gene Golub and Michael Heath and GCV 

After much struggle to prove some optimality properties of leaving-out-one, it became clear 
that it couldn't be done in general. Considering the data model y = f + e, where y = 
(yi, y n ) T , f = (/(ti), • • • , f(t n ) T and e = (ei, . . . , e n ) T is a zero mean i.i.d. Gaussian 
random n- vector then the information in the data is unchanged by multiplying left and right 
hand side by an orthogonal matrix, since Te with V orthogonal is still white Gaussian noise. 
But leaving-out-one can give you a different answer. To explain, we define the influence 
matrix: Let f\ be the minimizer of (11. ip when C is sum of squares. The influence matrix 
relates the data to the prediction of the data, fx = A(\)y, where f\ = (f\(ti), . . . , f\(t n )). 
A heuristic argument fell out of the blue, probably in an attempt to explain some things 
to students, that rotating the data so that the influence matrix was constant down the 
diagonal, was the trick. The result was that instead of leaving-out-one, one should minimize 
the GCV function V(X) = fefejjw HE]- I was on Sabbatical at Oxford in 1975 and 
Gene was at ETH visiting Peter Huber, who had a beautiful house in Klosters, the fabled 
ski resort. Peter invited Gene and me up for the weekend, and Gene just wrote out the 
algorithm in [8] on the train from Zurich to Klosters while I snuck glances at the spectacular 
scenery. Gene was a much loved mentor to lots of people. He was born on February 29, 1932 
and died in November of 2007. On February 29 and March 1, 2008 his many friends held 
memorial birthday services at Stanford and 30 other locations around the world. Ker-Chau 
Li [TfJ HEJ EE) and others later proved optimality properties of the GCV and popular codes in 
R will compute splines and other fits using GCV to estimate A and other important tuning 
parameters. Advice: Pay attention to important tuning parameters since the results can be 
very sensitive to them. Advice: Appreciate mentors like Gene if you are lucky enough to 
have such great mentors. 
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1.1.4 Didier Girard, Mike Hutchinson, Randomized Trace and the 
Degrees of Freedom for Signal 

Brute force calculation of the trace of the influence matrix A(X) can be daunting to compute 
directly for large n. Let f\ be the minimizer of (11. ip with the data vector y and let f% +5 
be the minimizer of ( jl.ip given the perturbed data y + 5. Note that 5 T (f% — fx +S ) = 
5 T A(X)(y + 5) — A(X)(y) = YHj=i 3i3j a iji where Si and a^- are the components of 5 and A(X) 
respectively. If the perturbations are i.i.d. with variance 1, then the expected value of this 
sum is an estimate of trace A(X). This simple idea was proposed simultaneously in [61 ITT] , 
with further theory in [7J. It was a big Ah-Ha when I saw these papers because further 
applications were immediate. In [32], p 139, I defined the trace of A(X) as the "Equivalent 
degrees of freedom for signal", by analogy with linear least squares regression with p < n 
where the influence matrix is a rank p projection operator. The degrees of freedom for 
signal is an important concept in linear and nonlinear nonparametric regression, and it was 
a mistake to hide it inconspicuously in [32] . Brad Efron later [3] gave an alternative definition 
of degrees of freedom for signal. The definition in [32] depends only on the data, Efron's 
is essentially an expected value. Note that in the model (II. ip . trace A(X) = Y17=i ff 1 ' nere 
iji is the predicted value of This definition can reasonably be applied to a problem with 
a nonlinear forward operator (that is, that maps data onto the predicted data) when the 
derivatives exist, and the randomized trace method is reasonable for estimating the degrees 
of freedom for signal, although care should be taken concerning the size of S. Even when 
the derivatives don't exist the randomized trace can be a reasonable way of getting at the 
degrees of freedom for signal, see for example [34] . 

1.1.5 Yuedong Wang, Chong Gu and Smoothing Spline ANOVA 

Sometime in the late 80's or early 90's I heard Graham Wilkinson expound on ANOVA 
(Analysis of Variance), where data was given on a regular <i-dimensional grid: yijk,tijk,i = 
1, . . . , J, j = 1, ... J, k = 1 . . . , K, for d = 3 and so forth. That is, the domain is the Cartesian 
product of several one-dimensional grids. Graham was expounding on how fitting a model 
from observations on such a domain could be described as set of orthogonal projections 
based on averaging operators, resulting in main effects, two factor interactions, etc. 'Ah- 
Ha" I thought, we should be able to do exactly same thing and more where the domain is 
the Cartesian product T — 7i £S>72 ® ■ ■ ■ ®Td of d arbitrary domains. We want to fit functions 
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on T, with main effects (functions of one variable), two factor interactions (functions of two 
variables), and possibly more terms given scattered observations, and we just need to define 
averaging operators for each T a . Brainstorming with Yuedong Wang and Chong Gu fleshed 
out the results. Let T-L a , a — 1, . . . , d be d RKHSs with domains T a , each "H Q containing the 
constant functions. % = H 1 (8> ■ • • <8> "H d is an RKHS with domain T. For each a — 1, ... ,d, 
construct a probability measure d\i a on T a , with the property that the symbol (S a f)(t), the 
averaging operator, defined by 

(S a f)(t) = I /(ti,-- - ,t d )dfi a (t a ), 

is well defined and finite for every / G % and t G T. Consider the decomposition of the 
identity operator: 

i=H(£ a +(i-s a )) (i.3) 

a 

This decomposition of the identity then always generates a unique (ANOVA-like) decompo- 
sition of / G % of the form 

f(t) = /J, + ^2f a (t a ) + ^2 / f a p{t ol ,tp) + ^ fapr,{ta,tp,ti) H (1-5) 

a a</3 a</3<7 

where the expansion is unique and (usually) truncated in some manner in practice. Here 

= (IL = {{! ~ £a)Y{^ a £p)f, fap = ((! ~ £a)(l ~ Z^H^a^^fi etc > are the 

mean, main effects, two factor interactions, etc. The result is usually called an SS ANOVA 
model, although the components are not limited to splines. For details on how to fit the 
terms see P, [10, [33J [36] and the assist and gss codes in R. Note that nothing has been said 
about T and very little regarding T-t , other than that the constant functions are in each of 
the constituent spaces and averaging operators can be defined. 

1.1.6 Vladimir Vapnik, the Mystery Caller and the SVM 

The AMS-IMS-SIAM Joint Summer Research Conference on Adaptive Selection of Models 
and Statistical Procedures was held on the campus of Mount Holyoke College in South 
Hadley, Massachusetts on Sunday, June 23 through Thursday, June 27, 1996. On one of 
those fine days a session met on a grassy lawn of Mount Holyoke College, when Vladimir 
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Vapnik and I were both invited speakers. I talked first, and noted how the solution to the 
optimization problem (II. 2p led to a function involving the span of the representers. Vladimir 
spoke next, describing the support vector machine (SVM), a well known and highly successful 
method for classification, describing something he called the "kernel trick" . He exhibited an 
SVM that was fitted in the span of representers in an RKHS. We will explain the SVM in a 
moment, but the original SVM, as proposed by Vapnik and coworkers [SU] was derived from 
an argument nothing like what I am about to give. Somewhere during Vladimir's talk, an 
unknown voice towards the back of the audience called out "That looks like Grace Wahba's 
stuff." It looked obvious that the SVM as proposed by Vapnik with the "kernel trick" , could 
be obtained as the the solution to the optimization problem of fll.2p with C(yi, Lif) replaced 
by the so called hinge function, (l — yif(ti)) + , where (r)+ = r if r > and otherwise. Each 
data point is coded as ±1 according as it came from the "plus" class or the "minus" class. 
For technical reasons the null space of the penalty function consists at most of the constant 
functions. Thus it follows that the solution is in the span of the representers K t . from the 
chosen RKHS plus possibly a constant function. Yi Lin and coworkers [20j EI] showed that 
the SVM was estimating the sign of the log odds ratio, just what is needed for two class 
classification. The SVM may be compared to the case where one desires to estimate the 
probability that an object is in the plus class. If one begins with the penalized log likelihood 
of the Bernoulli distribution and codes the data as ±1 instead of the usual coding as or 
1, then we have the same optimization problem with C(yi, f(ti)) = log(l + e - ^^* 1 ) instead 
of (1 — yif(ti)) + with solution in the same finite dimensional space, but it is estimating 
the log odds-ratio, as opposed to the sign of the log odds ratio. It was actually a big deal 
that the SVM could be directly compared with penalized likelihood with Bernoulli data, 
and it provided a pathway for statisticians and computer scientists to breach a major divide 
between them on the subject of classification, and to understand each others' work. 

For many years before the Hadley meeting, Olvi Mangasarian and I would talk about 
what we were doing in classification, neither of us having any understanding of what the 
other was doing. Olvi complained that the statisticians dismissed his work, but it turned 
out that what he was doing was related to the SVM and hence perfectly legitimate not to 
mention interesting, from a classical statistical point of view. Statisticians and computer 
scientists have been on the same page on classification ever since. 

It is curious to note that several patents have been awarded for the SVM. One of the 
early ones, issued on July 15, 1997 is "5649068 Pattern recognition system using support 
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vectors" 

I'm guessing that the unknown volunteer was David Donoho. 

Advice: Keep your eyes open to synergies between apparently disparate fields. 

1.1.7 Yoonkyung Lee, Yi Lin and the Multicategory SVM 

For classification, when one has k > 2 classes it is always possible to apply an SVM to 
compare membership in one class versus the rest of the k classes, running through the 
algorithm k times. In the early 2000s there were many papers on one-vs-rest, and designs 
for subsets vs other subsets, but it is possible to generate examples where essentially no 
observations will be identified as being in certain classes. Since one-vs-rest could fail in 
certain circumstances it was something of an open question how to do multicategory SVMs 
in one optimization problem that did not have this problem. Yi Lin, Yoonkyung Lee and 
I were sitting around shooting the breeze and one of us said "how about a sum-to-zero 
constraint?" and the other two said "Ah-Ha"!, or, at least that's the way I remember it. The 
idea is to code the labels as k- vectors, with a 1 in the rth position and —l/(k — l) in the k — 1 
other positions for a training sample in class r. Thus, each observation vector satisfies the 
sum-to-zero constraint. The idea was to fit a vector of functions satisfying the same sum- 
to-zero constraint. The multicategory SVM fit estimates f(t) = (fi(t),--- ,fk(t)), t G T 
subject to the sum-to-zero constraint everywhere and the classification for a subject with 
attribute vector t is just the index of the largest component of the estimate of f{t). See 
[HJ [15j [16] . Advice: Shooting the breeze is good. 

1.1.8 Fan Lu, Steve Wright, Sunduz Keles, Hector Corrada Bravo, 
and Dissimilarity Information 

We return to the alternative role of positive definite functions as a way to encode pairwise 
distance observations. Suppose we are examining n objects Oj, % — 1, . . . ,n and are given 
some noisy or crude observations on their pairwise distances/dissimilarities, which may not 
satisfy the triangle inequality. The goal is to embed these objects in a Euclidean space in 
such a way as to respect the pairwise dissimilarities as much as possible. Positive definite 
matrices encode pairwise squared distances dij between Oi and Oj as 

d tJ (K) = K(i,i) + K(j,j) - 2K(i,j), (1.6) 
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and, given a non-negative definite matrix of rank d < n, can be used to embed the n objects 
in a Euclidean space of dimension d, centered at and unique up to rotations. We seek a K 
which respects the dissimilarity information d°j S while constraining the complexity of K by: 

min KeSn \ d t ~ d a( K )\ + X traceK (1.7) 

where S n is the convex cone of symmetric positive definite matrices. I looked at this problem 
for an inordinate amount of time seeking an analytic solution but after a conversation with 
Vishy (S. V. N. Vishwanathan) at a meeting in Rotterdam in August of 2003 I realized it 
wasn't going to happen. The Ah-Ha moment came about when I showed the problem to 
Steve Wright, who right off said it could be solved numerically using recently developed 
convex cone software. The result so far is [3j [22]. In [22] the objects are protein sequences 
and the pairwise distances are BLAST scores. The fitted kernel K had three eigenvalues 
that contained about 95% of the trace, so we reduced K to a rank 3 matrix by truncating 
the smaller eigenvalues. Clusters of four different kinds of proteins were readily separated 
visually in three-d plots; see [22] for the details. In [3] the objects are persons in pedigrees 
in a demographic study and the distances are based on Malecot's kinship coefficient, which 
defines a pedigree dissimilarity measure. The resulting kernel became part of an SS ANOVA 
model with other attributes of persons, and the model estimates a risk related to an eye 
disease. Advice: Find computer scientist friends. 

1.1.9 Gabor Szekely, Maria Rizzo, Jing Kong and Distance Cor- 
relation 

The last Ah-Ha experience that we report is similar to that involving the randomized trace 
estimate of Section 11.1.41 that is, the Ah-Ha moment came about upon realizing that a par- 
ticular recent result was very relevant to what we were doing. In this case Jing Kong brought 
to my attention the important paper of Gabor Szekely and Maria Rizzo [27]. Briefly, this 
paper considers the joint distribution of two random vectors, X and Y, say, and provides a 
test, called distance correlation that it factors so that the two random vectors are indepen- 
dent. Starting with n observations from the joint distribution, let {Aij} be the collection of 
double-centered pairwise distances among the (™) X components, and similarly for {B^}. 
The statistic, called distance correlation, is the analogue of the usual sample correlation be- 
tween the A's and -B's. The special property of the test is that it is justified for X and Y in 
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Euclidean p and q space for arbitrary p and q with no further distributional assumptions. In 
a demographic study involving pedigrees [13], we observed that pairwise distance in death 
age between close relatives was less than that of unrelated age cohorts. A mortality risk 
score for four lifestyle factors and another score for a group of diseases was developed via 
SS ANOVA modeling, and significant distance correlation was found between death ages, 
lifestyle factors and family relationships, raising more questions than it answers regarding 
the " Nature- Nurture" debate (relative role of genetics and other attributes). 

We take this opportunity to make a few important remarks about pairwise distances/dissimilarities, 
primarily how one measures them can be important, and getting the "right" dissimilarity 
can be 90% of the problem. We remark that family relationships in [T3] were based on a 
monotone function of Malecot's kinship coefficient that was different from the monotone 
function in [3j. Here it was chosen to fit in with the different way the distances were used. 
In (jl.7p . the pairwise dissimilarities can be noisy, scattered, incomplete and could include 
subjective distances like "very close, close.. " etc. not even satisfying the triangle inequality. 
So there is substantial flexibility in choosing the dissimilarity measure with respect to the 
particular scientific context of the problem. In [13] the pairwise distances need to be a com- 
plete set, and be Euclidean (with some specific metric exceptions). There is still substantial 
choice in choosing the definition of distance, since any linear transformation of a Euclidean 
coordinate system defines a Euclidean distance measure. Advice: Think about how you 
measure distance or dissimilarity in any problem involving pairwise relationships, it can be 
important. 

1.2 Regularization Methods, RKHS and Sparse Mod- 
els 

The optimization problems in RKHS are a rich subclass of what can be called regularization 
methods, which solve an optimization problem which trades fit to the data versus complexity 
or constraints on the solution. My first encounter with the term "regularization" was [29] in 
the context of finding numerical solutions to integral equations. There the L, of (11. 2p were 
noisy integrals of an unknown function one wishes to reconstruct, but the observations only 
contained a limited amount of information regarding the unknown function. The basic and 
possibly revolutionary idea at the time was to find a solution which involves fit to the data 
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while constraining the solution by what amounted to an RKHS seminorm, (J (f"(t)) 2 dt) 
standing in for the missing information by an assumption that the solution was "smooth" 
[23| [3Tj . Where once RKHS were a niche subject, they are now a major component of the 
statistical model building/machine learning literature. 

However, RKHS do not generally provide sparse models, that is, models where a large 
number of coefficients are being estimated but only a small but unknown number are believed 
to be non-zero. Many problems in the "Big Data" paradigm are believed to have, or want 
to have sparse solutions, for example, genetic data vectors that may have many thousands 
of components and a modest number of subjects, as in a case-control study. The most 
popular method for ensuring sparsity is probably the LASSO [2J Here a very large 
dictionary of basis functions (Bj(t), j = 1,2...) is given and the unknown function is 
estimated as f(t) = /3jBj(t) with the penalty functional Aj^- \/3j\ replacing an RKHS 
square norm. This will induce many zeroes in the /3j, depending, among other things on 
the size of A. Since then, researchers have commented that there is a "zoo" of proposed 
variants of sparsity-inducing penalties, many involving assumptions on structures in the 
data; one popular example is [38] • Other recent models involve mixtures of RKHS and 
sparsity-inducing penalty functionals. One of our contribution to this "zoo" deals with the 
situation where the data vectors amount to very large "bar codes" , and it is desired to find 
patterns in the bar codes relevant to some outcome. An innovative algorithm which deals 
with a humongous number of interacting patterns assuming that only a small number of 
coefficients are non-zero is given in [251 [26], |37] . 

As is easy to see here and in the statistical literature, the statistical modeler has over- 
whelming choices in modeling tools, with many public codes available in the software repos- 
itory R and elsewhere. In practice these choices must be made with a serious understanding 
of the science and the issues motivating the data collection. Good collaborations with sub- 
ject matter researchers can lead to the opportunity to participate in real contributions to 
the science. Advice: Learn absolutely as much as you can about the subject matter of the 
data that you contemplate analyzing. When you use "black boxes" be sure you know what 
is inside them. 
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1.3 Remarks on the Nature-Nurture Debate, Person- 
alized Medicine and Scientific Literacy 



We and many other researchers have been developing methods for combining scattered, 
noisy, incomplete, highly heterogenous information from multiple sources with interacting 
variables to predict, classify, and determine patterns of attributes relevant to a response, or 
more generally multiple correlated responses. 

Demographic studies, clinical trials, and ad hoc observational studies based on electronic 
medical records, which have familial [3], [13], clinical, genetic, lifestyle, treatment and other 
attributes can be a rich source of information regarding the Nature-Nurture debate, as well 
informing Personalized Medicine, two popular areas reflecting much activity. As large med- 
ical systems put their records in electronic form interesting problems arise as to how to deal 
with such unstructured data, to relate subject attributes to outcomes of interest. No doubt 
a gold mine of information is there, particularly with respect to how the various attributes 
interact. The statistical modeling/machine learning community continues to create and im- 
prove tools to deal with this data flood, eager to develop better and more efficient modeling 
methods, and regularization and dissimilarity methods will no doubt continue to play an 
important role in numerous areas of scientific endeavor. With regard to human subjects 
studies, a limitation is the problem of patient confidentiality - the more attribute informa- 
tion available to explore for its relevance, the trickier the privacy issues, to the extent that 
de-identified data can actually be identified. It is important, however, that statisticians be 
involved from the very start in the design of human subjects studies. 

With health related research, the US citizenry has some appreciation of scientific results 
that can lead to better health outcomes. On the other hand any scientist who reads the 
newspapers or follows present day US politics is painfully aware that a non-trivial portion 
of voters and the officials they elect have little or no understanding of the scientific method. 
Statisticians need to participate in the promotion of increased scientific literacy in our edu- 
cational establishment at all levels. 

1.4 Conclusion 

In response to the invitation from COPSS to contribute to their 50th Anniversary Celebra- 
tion, I have taken a tour of some exciting moments in my career, involving RKHS and reg- 
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ularization methods, pairwise dissimilarities and distances, and LASSO models, dispensing 
un-asked for advice to new researchers along the way. I have made a few remarks concerning 
the richness of models based on RKHS, as well as models involving sparsity- inducing penal- 
ties with some remarks involving the Nature-Nurture Debate and Personalized Medicine. 
I end this contribution with thanks to my many coauthors-identified here or not, and to 
my terrific present and former students. Advice: Treasure your collaborators! Have great 
students! 
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